Predicting gene function in S. cerevisiae and A. thaliana using hierarchical multi-label decision tree ensembles
نویسندگان
چکیده
Motivation: S. cerevisiae and A. thaliana are two well-studied organisms in biology. Despite the fact that their genomes have already been completed in 1996 and 2000 respectively, the functions of 30% to 40% of their open reading frames (ORFs) remain unclassified. Different machine learning methods have been proposed that annotate the ORFs automatically. However, it is unclear which method is to be preferred in terms of predictive performance, efficiency, interpretability, and usability. Moreover, different evaluation measures for predictive performance have been used in the literature, each showing a limited aspect of the method’s performance. Results: We study the usefulness of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning decision trees that can make predictions for the ORFs automatically. We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods, while yielding equally interpretable results. The predictive accuracy of our trees, however, is still below that of some recently proposed statistical learning methods. Ensembles of such trees, on the other hand, give even better predictive results, comparable with those of state-of-the-art methods (sometimes better, sometimes worse), while the ensemble method scales much better and is easier to use. We conclude that decision tree based methods are currently the most efficient, easy-to-use, and flexible approach to ORF function prediction, flexible in the sense that they cover the spectrum from maximally interpretable to maximally accurate models. Our evaluation makes use of precision-recall-curves. We argue that this is a better evaluation criterion than previously used criteria. Our evaluation method can be seen as an additional contribution to the field. Availability: The software is freely available on http://www.cs.kuleuven.be/∼dtai/clus/.
منابع مشابه
Organization Workshop Co-chairs Program Committee Additional Referees an Ensemble Method for Multi-label Classification Using a Transportation Model 49 Ignoring Co-occurring Sources in Learning from Multi-labeled Data Leads Evaluation of Distance Measures for Hierarchical Multi-label Classification in Functional Genomics
Hierarchical multi-label classification (HMLC) is a variant of classification where instances may belong to multiple classes that are organized in a hierarchy. The approach we used is based on decision trees and is set in the predictive clustering trees framework (PCTs), which is implemented in the CLUS system. In this work, we are investigating how different distance measures for hierarchies i...
متن کاملTree ensembles for predicting structured outputs
In this paper, we address the task of learning models for predicting structured outputs. We consider both global and local predictions of structured outputs, the former based on a single model that predicts the entire output structure and the latter based on a collection of models, each predicting a component of the output structure. We use ensemble methods and apply them in the context of pred...
متن کاملبررسی کارایی مدل درختان تصمیمگیری در برآورد رسوبات معلق رودخانهای (مطالعه موردی: حوضه سد ایلام)
The real estimation of the volume of sediments carried by rivers in water projects is very important. In fact, achieving the most important ways to calculate sediment discharge has been considered as the objective of the most research projects. Among these methods, the machine learning methods such as decision trees model (that are based on the principles of learning) can be presented. Decision...
متن کاملNew Ant Colony Optimisation Algorithms for Hierarchical Classification of Protein Functions
Ant colony optimisation (ACO) is a metaheuristic to solve optimisation problems inspired by the foraging behaviour of ant colonies. It has been successfully applied to several types of optimisation problems, such as scheduling and routing, and more recently for the discovery of classification rules. The classification task in data mining aims at predicting the value of a given goal attribute fo...
متن کاملAn Extensive Evaluation of Decision Tree-Based Hierarchical Multi-Label Classification Methods and Performance Measures pdfsubject=Hierarchical Multi-Label Classification
Hierarchical Multi-Label Classification is a complex classification problem where an instance can be assigned to more than one class simultaneously, and these classes are hierarchically organized with superclasses and subclasses, i.e., an instance can be classified as belonging to more than one path in the hierarchical structure. This article experimentally analyses the behaviour of different d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008